Correlation of Gene Function Annotation Lists through Enhanced Spearman and Kendall Measures
نویسندگان
چکیده
Gene function annotations are paramount in bioinformatics, and computational methods able to predict them provide a fundamental contribution. Several machine learning algorithms for this purpose are available today, although their relevant parameters might strongly influence the output list of predicted annotations. Here, we propose a method to evaluate this issue by introducing two list correlation measures, based on the Spearman rank correlation coefficient and Kendall rank distance respectively, which are able to state the level of similarity between ordered annotation lists. We show the application of these measures to Gene Ontology annotation datasets, which leads to unveil interesting patterns between predicted annotation lists and express some statements about the prediction algorithms used. 1 Scientific background In bioinformatics, a controlled gene function annotation is the association of a gene with a controlled term that represents a functional feature; such term can be part of a terminology, or structured in an ontology such as the Gene Ontology (GO) [1]. Thus, the annotation states that the gene has the functional feature represented by the annotation term. For instance, is the annotation of the gene VMA9 to the biological function transmembrane transport. Despite their biological significance, some issues concern available annotations (e.g. erroneous or missing annotations) [2]. Thus, computational methods and software tools able to produce ranked lists of reliably predicted annotations are an excellent contribution to the field. In the past, we designed and developed some algorithms in this field. We started from a state-of-the-art algorithm based on truncated Singular Value Decomposition (tSVD) [3] and developed some variants [4]. Then, in [5] we designed an algorithm to choose the best truncation level for the tSVD and in [6] we designed and tested some topic modeling techniques. All these methods involve key parameters that influence the output. To understand how the resulting annotation lists change accordingly to variations of these key parameters, a similarity measure that compares different output lists is required. To accomplish this, currently the most useful and consistent measures are the Spearman rank correlation coefficient [7] and Kendall rank distance [8]. Here, we depart from a recent work by Ciceri et al. [9] to develop new weighted correlation metrics that compare ordered annotation lists. The remainder of this paper is organized as follows. Section 2 explains the prediction of gene function annotations and introduces the Spearman and Kendall measure variants that we developed for the comparison of ordered annotation lists. Section 3 shows some significant test results of the proposed measure variants and discusses them, while Section 4 concludes. Proceedings of CIBB 2014 2 2 Material and methods Let A = [aij] be a m × n matrix, where each row i corresponds to a gene and each column j corresponds to a functional feature term of a terminology or ontology, with aij = 1 if the gene i is annotated to the feature term j, or aij = 0 otherwise. Let θ be a fixed threshold value and suppose that a prediction algorithm elaborates the matrix A to produce an output matrix Ã, with the same dimensions of A, where each value ãij represents the likelihood of the annotation of the genei to the featurej . Thus, a high ãij value indicates that the probability of the genei to be associated with the featurej is high. An annotation list, ordered according to the ãij values, is finally defined and each annotation 〈genei, featurej, ãij〉 is classified in one of the following categories: Annotation Confirmed (AC): aij = 1 ∧ ãij > θ (similar to True Positive TP) Annotation Predicted (AP): aij = 0 ∧ ãij > θ (similar to False Positive FP) Non-Annotation Confirmed (NAC): aij = 0 ∧ ãij ≤ θ (similar to True Negative TN) Annotation to be Reviewed (AR): aij = 1 ∧ ãij ≤ θ (similar to False Negative FN) According to these categories, two annotation sublists are defined: an APlist, i.e. Annotation Predicted list, and a NAClist, i.e. Non-Annotation Confirmed list, which contain those annotations from the original list that were classified in the AP or NAC class, respectively. Furthermore, these four categories are used to build the Receiver Operating Characteristic (ROC) curve, which is a graphical plot depicting the performance of a binary classifier system for different discrimination threshold values. Similarly to its original definition, which uses TPrate and FPrate, our ROC curve depicts the trade-off between the ACrate and APrate, where:
منابع مشابه
Influence functions of the Spearman and Kendall correlation measures
Nonparametric correlation estimators as the Kendall and Spearman correlation are widely used in the applied sciences. They are often said to be robust, in the sense of being resistant to outlying observations. In this paper we formally study their robustness by means of their influence functions and gross-error sensitivities. Since robustness of an estimator often comes at the price of an incre...
متن کاملRobustness versus efficiency for nonparametric correlation measures
Nonparametric correlation measures at the Kendall and Spearman correlation are widely used in the behavioral sciences. These measures are often said to be robust, in the sense of being resistant to outlying observations. In this note we formally study their robustness by means of their influence functions. Since robustness of an estimator often comes at the price of a loss in precision, we comp...
متن کاملMultimedia Annotation: Comparability of Gloss Modalities and their Implications for Reading Comprehension
This study compared the effects of two annotation modalities on the reading comprehension of Iranian intermediate level EFL learners. The two experimental groups under study received treatment on 10 academic L2 reading passages under one of two conditions: One group received treatment on key words in the reading passages through a multimedia environment providing textual annotations. The second...
متن کاملThe Spearman and Kendall rank correlation coefficients between intuitionistic fuzzy sets
This paper is a continuation of our previous work on Pearson’s coefficient r and we discuss here the concept of the Spearman correlation coefficient and the Kendall correlation coefficient between Atanassov’s intuitionistic fuzzy sets (A-IFSs, for short) to measure the degree of association between the A-IFSs when the assumption that the data distributions are normal is not valid or when data a...
متن کاملPageRank and rank-reversal dependence on the damping factor
PageRank (PR) is an algorithm originally developed by Google to evaluate the importance of web pages. Considering how deeply rooted Google's PR algorithm is to gathering relevant information or to the success of modern businesses, the question of rank stability and choice of the damping factor (a parameter in the algorithm) is clearly important. We investigate PR as a function of the damping fa...
متن کامل